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Abstract 

In this paper, we study a fast approximation method for large-scale high¬ 
dimensional sparse least-squares regression problem by exploiting the Johnson- 
Lindenstrauss (JL) transforms, which embed a set of high-dimensional vectors 
into a low-dimensional space. In particular, we propose to apply the JL transforms 
to the data matrix and the target vector and then to solve a sparse least-squares 
problem on the compressed data with a slightly larger regularization parameter. 
Theoretically, we establish the optimization error bound of the learned model for 
two different sparsity-inducing regularizes, i.e., the elastic net and the £\ norm. 
Compared with previous relevant work, our analysis is non-asymptotic and ex¬ 
hibits more insights on the bound, the sample complexity and the regularization. 
As an illustration, we also provide an error bound of the Dantzig selector under 
JL transforms. 


1 Introduction 


Given a data matrix X £ M'" x d with each row representing an instance Q and a target vector 
R", the sparse least-squares regression (SLSR) is to solve the following 


y = (yi,---,y n ) e 

optimization problem: 


1 „ 

= arg min — ||Xw-y 
w£R d 2n 


Af?(w) 


(1) 


where f?(w) is a sparsity-inducing norm. In this paper, we consider two widely used sparsity- 
inducing norms: (i) the t\ norm that leads to a formulation also known as LASSO |22); (ii) the 
mixture of i\ and £2 norm that leads to a formulation known as the Elastic Net ED. Although £1 
norm has been widely explored and studied in SLSR, the elastic net usually yields better performance 
when there are highly correlated variables. Most previous studies on SLSR revolved around on two 
intertwined topics: sparse recovery analysis and efficient optimization algorithms. We aim to present 
a fast approximation method for solving SLSR with a strong guarantee on the optimization error. 


Recent years have witnessed unprecedented growth in both the scale and the dimensionality of data. 
As the size of data continues to grow, solving the problem 0} is still computationally difficult be¬ 
cause (i) the memory limitations could lead to increased additional costs (e.g., I/O costs, communi¬ 
cation costs in distributed environment); (ii) a large number n of instances or a high dimension d of 
features usually implies a slow convergence of optimization (i.e., a large iteration complexity). In 
this paper, we study a fast approximation method that employes the JL transforms to reduce 
the size of X £ R" xd and y€l n . In particular, let A £ R mxn (m <C n) denote a linear transfor¬ 
mation that obeys the JL lemma (c.f. LemmaQ}, we transform the data matrix and the target vector 
into X = AX £ R mxd and y = Ay £ R. m . Then we optimize a slightly modified SLSR problem 
using the compressed data X and y to obtain an approximate solution w*. The proposed method 

*n is the number of instances and d is the number of features. 
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is supported by (i) a theoretical analysis that provides a strong guarantee of the proposed approx¬ 
imation method on the optimization error of w* in both (2 norm and i\ norm, i.e., ||w* — w *||2 
and ||w* — w*||i; and (ii) empirical studies on a synthetic data and a real dataset. We emphasize 
that besides in large-scale learning, the approximation method by JL transforms can be also used in 
privacy concerned applications, which is beyond the scope of this work. 

In fact, our work is not the first that employes random reduction techniques to reduce the size of the 
data for SLSR and studies the theoretical guarantee of the approximate solution. The most relevant 
work is presented by Zhou & Lafferty & Wasserman |30l (referred to as Zhou’s work). Below we 
highlight several key differences from Zhou’s work, which also emphasize our contributions: 

• Our formulation on the compressed data is different from that in Zhou’s work, which simply 
solves the same SLSR problem using the compressed data. We introduce a slightly larger 
i\ norm regularize^ which enjoys an intuitive geometric explanation. As a result, it also 
sheds lights on the Dantzig selector 0 under JL transforms, a theoretical result of which 
is also presented. 

• Zhou’s work focused on the regularized least-squares regression and the Gaussian ran¬ 
dom projection. We consider two sparsity-inducing regularizes including the elastic net 
and the i\ norm. Since our analysis is based on the JL lemma, hence any JL transforms are 
applicable. 

• Zhou’s theoretical analysis is asymptotic, which only holds when the number of instances 
n approaches infinity, and it requires strong assumptions about the data matrix and other 
parameters for obtaining sparsitency (i.e., the recovery of the support set) and the persis¬ 
tency (i.e., the generalization performance). In contrast, our analysis of the optimization 
error relies on relaxed assumptions and is non-asymptotic. In particular, for the l\ norm we 
assume the standard restricted eigen-value condition in sparse recovery analysis. For the 
elastic net, by exploring the strong convexity of the regularize^ we can be even exempted 
from the restricted eigen-value condition and can derive better bounds when the condition 
is true. 

The remainder of the paper is organized as follows. In Section[2] we review some related work. We 
present the proposed method and main results in Section 0 and [4] Numerical experiments will be 
presented in Section0followed by conclusions. 

2 Related Work 

Sparse Recovery Analysis. The LASSO problem has been one of the core problems in statistics 
and machine learning, which is essentially to learn a high-dimensional sparse vector u* £ W 1 from 
(potentially noise) linear measurements y = lu, + £ £ R n . A rich theoretical literature l22l 
[2£l ED describes the consistency, in particular the sign consistency, of various sparse regression 
techniques. A stringent “ unrepresentable condition” has been established to achieve sign consistency. 
To circumvent the stringent assumption, several studies IflTI fl8l have proposed to precondition the 
data matrix X and/or the target vector y by PX and Py before solving the LASSO problem, where 
P is usually an x n matrix. The oracle inequalities of the solution to LASSO 0 and other sparse 
estimators (e.g., the Dantzig selector 83) have also been established under restricted eigen-value 
conditions of the data matrix X and the Gaussian noise assumption of £. The focus in these studies 
is on when the number of measurements n is much less than the number of features, i.e., n < i 
Different from these work, we consider that both n and d are significantly larged and aim to derive 
fast algorithms for solving the SLSR problem approximately by exploiting the JL transforms. The 
recovery analysis is centered on the optimization error of the learned model with respect to the 
optimal solution w* to 0, which together with the oracle inequality of w* automatically leads to 
an oracle inequality of the learned model under the Gaussian noise assumption. 

Approximate Least-squares Regression. In numerical linear algebra, one important problem is 
the over-constrained least-squares problem, i.e., finding a vector w opt such that the Euclidean norm 
of the residual error ||Xw — y ||2 is minimized, where the data matrix X £ R raxd has n d. 
The exact solver takes 0(nd 2 ) time complexity. Several pieces of works have proposed randomized 
algorithms for finding an approximate solution to the above problem in o(nd 2 ) |9]|8]. These works 
share the same paradigm by applying an appropriate random matrix A £ R r " x n to both .V and y and 

2 This setting recently receives increasing interest (26). 
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solving the induced subproblem, i.e., w opt = argmin wgR d ||A(Jfw — y)|| 2 - Relative-error bounds 
for ||y — Xw opt ||2 and ||w opt — w opt ||2 have been developed. Although the proposed method uses 
a similar idea to reduce the size of the data, there is a striking difference between our work and 
these studies in that we consider the sparse regularized least-squares problem when both n and d 
are very large. As a consequence, the analysis and the required condition on to are substantially 
different. The analysis for over-constrained least-squares relies on the low-rank of the data matrix 
X, while our analysis hinges on the inherent sparsity of the optimal solution w*. In terms of the 
value of to for accurate recovery, approximate least-squares regression requires to = 0(d log d/e 2 ). 
In contrast, for the proposed method, our analysis exhibits that the order of m is 0(s log d/e 2 ), 
where s is the sparsity of the optimal solution w* to 01. In addition, the proposed method can 
utilize any JL transforms as long as they obey the JL lemma. Therefore, our method can benefit 
from recent advances in sparser JL transforms, leading to a fast transformation of the data. 

Random Projection based Learning. Random projection has been employed for addressing 
the computational challenge of high-dimensional learning problems 0. In particular, if let 
xi,..., x„ £ R d denote a set of instances, by random projection we can reduce the high¬ 
dimensional features into a low dimensional feature space by x, = Axi € R m , where A £ R mx<i 
is a random projection matrix. Several works have studied some theoretical properties of learning 
in the low dimensional space. For example, m considered the following problem and its reduced 
counterpart (R): 

w. = arg min — 
wGR d tl 

Paul et al. fl9l focused on SVM and showed that the margin and minimum enclosing ball in the 
reduced feature space are preserved to within a small relative error provided that the data matrix 
X £ R nxd is of low-rank. Zhang et al. Il27i studied the problem of recovering the original optimal 
solution w* and proposed a dual recovery approach, i.e., using the learned dual variable in the 
reduced feature space to recover the model in the original feature space. They also established a 
recovery error under the low-rank assumption of the data matrix. Recently, the low-rank assumption 
is alleviated by the sparsity assumption. Zhang et al. |28l considered a case when the optimal 
solution w* is sparse and Yang et al. Il25l assumed the optimal dual solution is sparse and proposed 
to solve a i\ regularized dual formulation using the reduced data. They both established a recovery 
error in the order of 0 (\/s/to||w* H 2 ), where s is the sparsity of the optimal primal solution or 
the optimal dual solution. Random projection for feature reduction has also been applied to the 
ridge regression problem HD. However, these methods do not apply to the SLSR problem and their 
analysis is developed mainly for the (.2 norm square regularizer. In order to maintain the sparsity 
of w, we consider compressing the data instead of the features so that the sparse regularizer is 
maintained for encouraging sparsity. Moreover, our analysis exhibits an recovery error in the order 
of 0{y/s] TO.||e|| 2 ), where e = ATw* — y whose magnitude could be much smaller than w*. 

The JL Transforms. The JL transforms refer to a class of transforms that obey the JL lemma fl2l . 
which states that any N points in Euclidean space can be embedded into 0(e 2 log A") dimensions 
so that all pairwise Euclidean distances are preserved upto lie. Since the original Johnson- 
Lindenstrauss result, many transforms have been designed to satisfy the JL lemma, including Gaus¬ 
sian random matrices 0. sub-Gaussian random matrices US, randomized Hadamard transform |2), 
sparse JL transforms by random hashing |6l fT3l . The analysis presented in this work builds upon the 
JL lemma and therefore our method can enjoy the computational benefits of sparse JL transforms 
including less memory and fast computation. 

3 A Fast Sparse Least-Squares Regression 

Notations: Let (x,, yf), i = 1,..., n be a set of n training instances, where x, £ R d and x/i £ R. 
We refer to X = (xi,X 2 ,... , x„) T = (xi,..., x<j) £ R" xd as the data matrix and to y = 
(j/i,..., y n ) T £ R n as the target vector, where x 3 denotes the j column of X. To facilitate our 
analysis, let R be the upper bound of maxi <j<d ||xj ||2 < R- Denote by || • ||i and || • H 2 the l\ 
norm and the I 2 norm of a vector. A function /(w) : R d —»• R is A-strongly convex with respect to 
|| • || 2 if Vw, u £ R d it satisfies /(w) > /(u) + <9/(u) T (w — u) + ^||w — u|||. A function /(w) 
is L-smooth with respect to || • ||2 if for Vw, u £ R d , || V/(w) — V/(w )||2 < L||w — u|| 2 , where 
df(-) and V/(-) denotes the sub-gradient and the gradient, respectively. In the analysis below for 
the LASSO problem, we will use the following restricted eigen-value condition 0. 


T x uVi ) + — 1| w 


\l, 


R: min — 

iizzTffm rt f ^ 


e(u T Xi,yi) + —||u| 
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Assumption 1. For any integer 1 < s < d, the matrix X satisfies the restricted eigen-value condi¬ 
tion at the sparsity level s if there exist positive constants 0 m in (s) and 4> ma x (s) such that 


0min(s) — 


w T I T Iw 


mm 

J ,l<||w||o<s 


W 


and fimaxis) = max 

w£R d ,l<||w||o<s 


w T I T Iw 


W 


The goal of SLSR is to learn an optimal vector w* = (w*i ,..., uu,i ) 1 that minimizes the sum 
of the least-squares error and a sparsity-inducing regularizer. We consider two different sparsity- 
inducing regularizes: (i) the l\ norm: f?(w) = ||w||i = 1 |ruj|; (ii) the elastic net: f?.(w) = 

i||w||2 + j||w|| i. Thus, we rewrite the problem in £[]) into the following form: 

w* = arg min -^||Aw-y||| + ^||w||| + r||w||i (2) 

wGR d 2n 2 

When A = 0 the problem is the LASSO problem and when A > 0 the problem is the Elastic Net 
problem. Although many optimization algorithms have been developed for solving (J2]», they could 
still suffer from high computational complexities for large-scale high-dimensional data due to (i) an 
0(nd) memory complexity and (ii) an f l(nd) iteration complexity. 

To alleviate the two complexities, we consider using the JL transforms to reduce the size of data, 
which are discussed in more details in subsection 13.21 In particular, we let A £ R mx " denote 
the transformation matrix corresponding to a JL transform, then we compute a compressed data by 
X = AX £ R mx,i and y = Ay £ R m , and then solve the following problem: 

w* = arg min ^-\\Xw - y|| \ + ^||w||| + (r + o-)||w||i (3) 

weR d 2n 2 

where o > 0, whose theoretical value is exhibited later. We emphasize that to obtain a bound on the 
optimization error of w*, i.e., ||w* — w* ||, it is important to increase the value of the regularization 
parameter before the i\ norm. Intuitively, after compressing the data the optimal solution may 
become less sparse, hence increasing the regularization parameter can pull the solution towards 
closer to the original optimal solution. 

Geometric Interpretation. We can also explain the added parameter er from a geometric viewpoint, 
which sheds insights on the theoretical value of cr and the analysis for the Dantzig selector under JL 
transforms. Without loss of generality, we consider A = 0. Since w* is the optimal solution to the 
original problem, then there exists a sub-gradient g £ <9||w*||i such that ^-X T (X w* — y) + rg = 0. 
Since H^Hoo < 1, therefore w* must satisfy i||X T (Xw* — y)||oo < r, which is also the constraint 
in the Dantzig selector. Similarly, the compressed problem ([3]) also defines a domain of the optimal 
solution w*, i.e., 

= jw G : i||X T (Xw - y)lloo < r + ct| (4) 


It turns out that o is added to ensure that the original optimal solution w* lies in D w provided that 
<7 is set appropriately, which can be verified as follows: 


X T (Xw*-y) 


I T (Iw, - 


< I||X T (Xw,-y )|| 00 + i X T {Xs 

n n 

< r + -\\X T (A t A - I)(X w, - y)|| 0 

n 


y) + A' t (Iw, - y) - X t (A'w» 
-y) - A' T (Xw* -y) 


y) 


oo 


Hence, if we set cr > A|| A T (A T A — /)( Aw* — y)||oo> it is guaranteed that w* also lies in 2? w . 
Lemma|2]in subsection 13.31 provides an upper bound ^||A T (A T A — J)(Aw* — y)||oo, therefore 
exhibits a theoretical value of cr. The above explanation also sheds lights on the Dantzig selector 
under JL transforms as presented in SectionQ] 


3.1 Optimization 

Before presenting the theoretical guarantee of the obtained solution w*, we compare the optimiza¬ 
tion of the original problem (O and the compressed problem (|3j- I' 1 particular, we focus on A > 0 
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since the optimization of the problem with only l\ norm can be completed by adding the (■> norm 
square with a small value of A 1211 . 


We choose the recently proposed accelerated stochastic proximal coordinate gradient method 
(APCG) |[T6ti . The reason are threefold: (i) it achieves an accelerated convergence for optimiz¬ 
ing 0, i.e., a linear convergence with a square root dependence on the condition number; (ii) it 
updates randomly selected coordinates of w, which is well suited for solving 0 since the dimen¬ 
sionality d is much larger than the equivalent number of examples rri; (iii) it leads to a much simpler 
analysis of the condition number for the compressed problem 0. First, we write the objective 
functions in 0 and ([3]) into the following general form: 

/(w) + t'||w||i = - b||| + ^||w||^ + t'IMIi (5) 


where C = (ci,..., c^) £ K Arxd . For simplicity, we consider the case when each block of coordi¬ 
nates corresponds to only one coordinate. The key assumption of APCG is that the function /(w) 
should be coordinate-wise smooth. To this end, we let e } denote the y-th column of the identity 
matrix and note that 

V/(w) = -C T Cw — —C T b + Aw, V j f (w) = ej\7f(xv) = — eJC T Cw + Au> ? - — — [C T bl; 
n n J n J n 

Assume maxi<j<c 2 11c^ 11 2 < Rc, then for any hj € K., we have 


|Vj/(w + hje-j) 


Vy/(w)| = 


< 


—e7C' T C(w + e 7 /i 7 ) — — ej C T C w + A/i, 
n J n J 

(-| e JC T C e ,| + A) | hj\ < + IM 


Therefore /(w) is coordinate-wise smooth and the smooth parameter is R^/n+A. On the other hand 
/(w) is also A-strongly convex function. Therefore the condition number that affects the iteration 

complexity is k = n c/" +x ~ and the iteration complexity is given by 
O (cVKlog(l/e 0 )) = O {^\j Rc ^ n ^ — 1 °g( 1 / e o)^ = O 

where e 0 is an accuracy for optimization. Since the per-iteration complexity of APCG for (0 is 
O(N), therefore the time complexity is given by O ^Nd + Nd\J , where O suppresses the 

logarithmic term. Next, we can analyze and compare the time complexity of optimization for 0 
and 0. For 0. N = n and R c = R. For (3)JV = m, and by the JL lemma for A (LemmaQ}, with 
a high probability 1 — S we have R c = maxi <j<d ||Abc,j |2 < rriaxi<j< ( j; y/1 + e m ||xjj| 2 , where 
e m = 0(^\og(d/5)/ m). Let m be sufficiently large, we can conclude that R c for X is O(R). 
Therefore, the time complexities of APCG for solving 0 and 0 are 



(2 ):0 


nd + dR 


log(l/e o ; 


(3) : O 


nd + dR 


log(l/e D ) 


Hence, we can see that the optimization time complexity of APCG for solving 0 can be reduced 
upto a factor of 1 — —, which is substantial when m <C n. The total time complexity is discussed 
after we introduce the JL lemma. 


3.2 JL Transforms and Running Time 

Since the proposed method builds on the JL transforms, we present a JL lemma and mention several 
JL transforms. 

Lemma 1. [JL Lemma KZH?/ For any integer n > 0, and any 0 < e, <5 < 1/2, there exists a 
probability distribution on in x n real matrices A such that there exists a small universal constant 
c > 0 and for any fixed x with a probability at least 1 — 5, we have 

\\\Ax\\l-Ml\<c^J^^Mi ( 6 ) 
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In other words, in order to preserve the Euclidean norm for any vector x £ {xi,..., x r/ } within a 
relative error e, we need to have m = 0(e” 2 log(d/5)). Proofs of the JL lemma can be found in 
many studies (e.g., ghdehshu). The value of m in the JL lemma is optimal E3- In these studies, 
different JL transforms A £ M m x n are also exhibited, including Gaussian random matrices Q:, 
subGaussian random matrices (T), randomized Hadamard transform 0 and sparse JL transforms |6] 
ri3l . For more discussions on these JL transforms, we refer the readers to (25). 

Transformation time complexity and Total Amortizing time complexity. Among all the JL trans¬ 
forms mentioned above, the transform using the Gaussian random matrices is the most expensive 
that takes 0(mnd) time complexity when applied to I e R raxd , while randomized Hadamard 
transform and sparse JL transforms can reduce it to 0{nd) where ()(■) suppresses only a logarith¬ 
mic factor. Although the transformation time complexity still scales as nd , the computational benefit 
of the JL transform can become more prominent when we consider the amortizing time complexity. 
In particular, in machine learning, we usually need to tune the regularization parameters (aka cross- 
validation) to achieve a better generalization performance. Let K denote the total number of times 
of solving (0 or 0, then the amortizing time complexity is given by time proc + K ■ time op t, where 
time proc refers to the time of the transformation (zero for solving 0) and time OJrt is the optimization 
time. Since time OJ)t for 0 is reduced significantly, hence the total amortizing time complexity of 
the proposed method for SLSR is much reduced. 


3.3 Theoretical Guarantees 


Next, we present the theoretical guarantees on the optimization error of the obtained solution w*. 
We emphasize that one can easily obtain the oracle inequalities for w* using the optimization error 
and the oracle inequalities of~w t under the Gaussian noise model, which are omitted here. We 
use the notation e to denote Iw, — y = e and assume ||e ||2 < q. Again, we denote by R the upper 
bound of column vectors in X, i.e., max i < :] <d \ x, 11 2 < R- We first present two technical lemmas. 
All proofs are included in the appendix. 


Lemma 2. Let q = — X T (A T A — I)e. With a probability at least l — 5, we have 
n 


Nl°° ^ 


cpR \og(d/S) 


where c is the universal constant in the JL Lemma. 

1 w T (X T X-X T X)w 


. If X satisfies the restricted 


Lemma 3. Let p(s) = max 

II w ll2< 1,II w ll 1 <%/s n 

eigen-value condition as in Assumption\J\ then with a probability at least 1 — 6, we have 


p(s) < 16 c/ max (s) 


log(l/(5) + 2s log(36d/s) 


where c is the universal constant in the JL lemma. 


Remark: Lemma[2]is used in the analysis for Elastic Net, LASSO and Dantzig selector. Lemma [3] 
is used in the analysis for LASSO and Dantzig selector. 

Theorem 2 (Optimization Error for Elastic Net). Let a = O 

where c is an universal constant in the JL lemma. Let w* and w* be the optimal solutions to 0 
and 0/or A > 0, respectively. Then with a probability at least 1 — S, for p = 1 or 2 we have 

n ~ 11 „ / pR / s 2 / p log (d/5) \ 

llw.-w.ll 


Remark: First, we can see that the value of a is large than ||q||oo with a high probability due to 
Lemma [2] which is consistent with our geometric interpretation. The upper bound of the optimiza¬ 
tion error exhibits several interesting properties: (i) the term of \J s /Pl, ^ d J s > occurs commonly in 
theoretical results of sparse recovery D3; (ii) the term of R/X is related to the condition number of 
the optimization problem 0, which reflects the intrinsic difficulty of optimization; and (iii) the term 


6 

















of p/n is related to the empirical error of the optimal solution w*. This term makes sense because 
if T] = 0 indicating that the optimal solution w* satisfies Iw, — y = 0, then it is straightforward to 
verify that w* also satisfies the optimality condition of 0 for a = 0. Due to the uniqueness of the 
optimal solution to 0, thus w* = w*. 

Theorem 3 (Optimization Error for LASSO). Assume X satisfies the restricted eigen-value con¬ 
dition in Assumption [7] Let a = O ^ where c is an universal 

constant in the JL lemma. Let w* and w* be the optimal solutions to 0 and 0 with X = 0, re¬ 
spectively, and A = </> m ; n (16s) — 2p(16s). Assume A > 0, then with a probability at least 1 — 5, for 
p = 1 or 2 we have _ 

ii- u l 1 l R Is 2/p \°giyd/5) \ 

IK-w.llpso^—y —-—j 

Remark: Note that A in Theorem[2]is replaced by A in Theorem[3] In order to make the result to 
be valid, we must have A > 0, i.e., m > fl(K 2 (16s)(log(l/5) + 2s log(36d/s))), where k(16s) = 
0 max(t 6 s) . j n addition, if the conditions in Theorem [3] hold, the result in Theorem [2] can be made 
stronger by replacing A with A + A. 


4 Dantzig Selector under JL transforms 


In light of our geometric explanation of er, we present the Dantzig selector under JL transforms 
and its theoretical guarantee. The original Dantzig selector is the optimal solution to the following 
problem: . 

wf = min ||w||i, s.t. -||X T (Xw - y)^ < r (7) 

wGR d n 


Under JL transforms, we propose the following estimator 


wf = min ||w||i, 
weM d 


1 

s.t. — 
n 


X T (Xw-y) 


< r + a 


( 8 ) 


Lrom previous analysis, we show that wf satisfies the constraint in ® provided that a > ||q||oo. 
which is the key to establish the following result. 


Theorem 4 (Optimization Error for Dantzig Selector). Assume X satisfies the restricted eigen-value 
condition in Assumption [7] Let a = 0 , w ^ ere c i s an universal 


constant in the JL lemma. Let w,P and wf 3 be the optimal solutions to 0 and 0, respectively, and 
A = </>min(4s) — pifis). Assume A > 0, then with a probability at least 1 — 5, for p = 1 or 2 we 
have _ 


llw? - w?| 


P <o 



s 2 /p\og(d/6) rs 1/,p 


A 


Remark: Compared to the result in Theorem [3] the definition of A is slightly different, and there 

1 / p 

is an additional term of rs A . This additional term seems unavoidable since r/ = 0 doest not 
necessarily indicate w 3J is also the optimal solution to However, this should not be a concern if 
we consider the oracle inequality of w 3) via the oracle inequality of wf, which is | wf 3 — u* || p < 

O ( 4 S ) ) under the Gaussian noise assumption and r = 0 



5 Numerical Experiments 

In this section, we present some numerical experiments to complement the theoretical results. We 
conduct experiments on two datasets, a synthetic dataset and a real dataset. The synthetic data is 
generated similar to previous studies on sparse signal recovery l24j . In particular, we generate a 
random matrix X £ M. nxd with n = 10 4 and d = 10 5 . The entries of the matrix A' are generated 
independently with the uniform distribution over the interval [—1, +1]. A sparse vector u* € K l is 
generated with the same distribution at 100 randomly chosen coordinates. The noise £ £ M" is a 
dense vector with independent random entries with the uniform distribution over the interval [—a, a ], 
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Figure 1: Optimization error of elastic net and lasso under different settings on the synthetic data, 
lasso lasso lasso 




Figure 2: Optimization or Regression error of lasso under different settings on the E2006-tfidf. 

where cr is the noise magnitude and is set to 0.1. We scale the data matrix X such that all entries 
have a variance of 1/n and scale the noise vector £ accordingly. Finally the vector y was obtained as 
y = 2Tu» +£. For elastic net on the synthetic data, we try two different values of A, 10 -8 and 10~ 5 . 
The value of r is set to 10 -5 for both elastic net and lasso. Note that these values are not intended 
to optimize the performance of elastic net and lasso on the synthetic data. The real data used in the 
experiment is E2006-tfidf dataset. We use the version available on libsvm website 0. There are a 
total of n = 16, 087 training instances and d = 150,360 features and 3308 testing instances. We 
normalize the training data such that each dimension has mean zero and variance 1/n. The testing 
data is normalized using the statistics computed on the training data. For JL transform, we use the 
random hashing. 

The experimental results on the synthetic data under different settings are shown in Figure[I] In the 
left plot, we compare the optimization error for elastic net with A = 10” 8 and two different values 
of m, i.e., to = 1000 and m = 2000. The horizontal axis is the value of cr, the added regularization 
parameter. We can observe that adding a slightly larger additional l\ norm to the compressed data 
problem indeed reduces the optimization error. When the value of a is larger than some threshold, 
the error will increase, which is consistent with our theoretical results. In particular, we can see that 
the threshold value for to = 2000 is smaller than that for to = 1000. In the middle plot, we compare 
the optimization error for elastic net with to = 1000 and two different values of the regularization 
parameter A. Similar trends of the optimization error versus cr are also observed. In addition, it is 
interesting to see that the optimization error for A = 10 8 is less than that for A = 10 -5 , which 
seems to contradict to the theoretical results at the first glance due to the explicit inverse dependence 
on A. However, the optimization error also depends on ||e|| 2 , which measures the empirical error 
of the corresponding optimal model. We find that with A = 10 -8 we have a smaller ||e ||2 = 0.95 
compared to 1.34 with A = 10 -5 , which explains the result in the middle plot. For the right plot, we 
repeat the same experiments for lasso as in the left plot for elastic net, and observe similar results. 

The experimental results on E2006-tfidf dataset for lasso are shown in Figure [2] In the left plot, we 
show the root mean square error (RMSE) on the testing data of different models learned from the 
original data with different values of r. In the middle and right plots, we fix the value of r = 10 -4 
and increase the value of cr and plot the relative optimization error and the RMSE on the testing 
data. Again, the empirical results are consistent with the theoretical results and verify that with JF 
transforms a larger regularizer yields a better performance. 

6 Conclusions 

In this paper, we have considered a fast approximation method for sparse least-squares regression 
by exploiting the JF transform. We propose a slightly different formulation on the compressed 

? http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/ 
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data and interpret it from a geometric viewpoint. We also establish the theoretical guarantees on 
the optimization error of the obtained solution for elastic net, lasso and Dantzig selector on the 
compressed data. The theoretical results are also validated by numerical experiments on a synthetic 
dataset and a real dataset. 
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A Proofs of main theorems 

A.l Proof of Theorem 2 

Recall the definitions 


q = — X T (A T A — I)e, e = Iw, — y 


(9) 


Lirst, we note that 


= arg min -!-||Xw - y|ll + ^||w||| + (r + cr)||w||i 
weR d 2n 2 


= arg mm 

wGR d 2 n 


if 

In V 


w T l’ T Iw - 2w 1 X 1 y) + 7 ^||w ||2 + (r + er)||w||i 


I 2 

,T yT, 


A, 


F( w) 


and 


w* = arg min ^-||JYw — y||| + ^||w|| 


12 -T r ll W lll 

By optimality of w* and the strong convexity of F( w), for any g £ <9|| w* || i we have 

0 > F( w*) - F( w*) >(w* - w*) T |-I T Iw, - —X T y + Aw* ] + (r + cr)(w* - w *) T g 

\n n ) 


+ ^||w* - 

By the optimality condition of w*, there exists h £ d||w*||i such that 

-A’ t Xw, -A T y + Aw* + rh = 0 

n n 

By utilizing the above equation in (flTTb . we have 

0 >(w* - w*) T q + (w* - w*) T [(r + a)g - rh] + ^||w* - w : 


*112 


( 10 ) 


( 11 ) 


( 12 ) 
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Let S denote the support set of w* and S c denote its complement set. Since g could be any sub¬ 


gradient of || w||! at w*, we define g as gi = 


i £ S 


. Then we have 


sign(w*i), i £ S c 

(w* — w*) T [(r + a)g — rh\ = — w*i)(ahi) + ^ (ui*i — w*i)(asign(w*i ) + T(sign(ui*i) — hi)) 

i&S i£S c 

> -ct||[w* - W*]^! + E asign(w t i)w t i + E T(sign(w*.i) - hfji 


w* 


iG«S c 

> -cr||[w* - W*] 5 111 +0-11 [w J s j|i 


iG*S c 


where the last inequality uses \hi\ < 1 and ^2 ieS ( sign(w*i) — hi)w t i > 0. Combining the above 
inequality with (IT2l >. we have 

0 > - ||w* - w* ||i || q|| oo - o’11 [w* - w*] 5 ||i +cr||[w*] Sc ||i + ^||w* - w*||^ 

By splitting ||w*-w*||i = || [w* - w*]s||i +1| [w*-w»]s c ||i and reorganizing the above inequality 
we have 


^I|w* - W*||^ + (cr- llqlloo) 


]»s c [Ii < (cr+ ||q|| oo) || [w* - w*] 5 ||i 


If (j > 2||q|| 00 , then we have 


^l|w» - w*||2 < y ||[w* - W*] 5 ||i 


(13) 

(14) 


l|[w*] s =|| 1 < 3||[w* - W*] S ||! 

Note that the inequality (fTTb hold regardless the value of A. Since 
||[w* - w*] 5 ||i < \/s||[w* - w*] s || 2 , and ||w* - w*|| 2 > max(||[w* - w»] s || 2 , ||[w*] 5 <=|| 2 ), 
by combining the above inequalities with (fl3l >. we can get 

||w* - W*|| 2 < yv/s, ||[w» - W»] S ||i < ^s 


A 


and 


||w* - W*|| i < II[w*]^ c Hi + ||[w* - W*] 5 ||! < 3||[w* - w *] s || 1 + II[w* - W*] 5 ||! < 

We can then complete the proof of Theorem 2 by noting the upper bound of ||q|| oo in Lemma 2 and 
by setting a according to the Theorem. 


A.2 Proof of Theorem 3 


When A = 0, the reduced problem becomes 

1 


w* = arg min -H|Xw - y|| 2 + (r + cr)||w||i 
weR d 2 n 


(15) 


From the proof of Theorem 2, we have 

||[w»] lS c|| 1 < 3||[w* - w*] 5 ||i, and 


F(w) 


|w* - w«||i 
|w* - w »|| 2 


- w»jg||i 
||w* - w*|| 2 


< 4 yjs 


Then we can have the following lemma, whose proof of the lemma is deferred to next section. 
Lemma 4. If X satisfies the restricted eigen-value condition at sparsity level 16s, then 

^min(16s)||w* - w*|| 2 < (w* - w*) T 2f T X(w* - w*) < 4^ max (16s)IIw* - W*|| 2 
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Then we proceed our proof as follows. Since w* optimizes the original problem, we have for any 
g G 3||w*||i 


0 > (w* — w*) T ( -I T Iw, — —X 1 y ) + r(w* — w*) 1 g + —— (w* — w*) 1 X 1 A'(w* — w*) 


1 


1 


2 n' 


Since w* optimizes F( w), there exists h G <9||w*||i, we have 


0 > (w* — w*) T \—X T Xv /* — —A' T y ) + (r + cr)(w* — w*) T /i 
\n n ) 

Combining the two inequalities above we have 

0 >(w* — w*) T (-I T Iw, — — A T y — -I T Iw, + —X T y ) + (w* — w*) t (t/i + ah — rg) 

\n n n n ) 

+ —— (w* - w*) T X T A(w* - w*) 

2 n 

= (w* - w*) T (-A' t Iw» - -X T y - -I T Iw» + —X T y ] + (w* - w*) T (r/i + ah — rg) 

\n n n n 


1 


+ 7 p-(w* - w*) 1 X 1 X(w* - w*) + (w* - w*) T ( -X T X(w* - w*) - —X 1 X(w* - w*) 


2 n 


1 


n 


n 


— (w* - w*) T ( -I T Iw, - —X T y - -I T Iw, + —X T y ) + (w* - w*) 1 (rh + ah - rg) 
\n n n n 


1 


+ — w*) 1 X 1 X(w* — w*) + (w* — w*) T ( — X T X - —X 1 X ) (w* — w*) 

2n \n n J 

By setting ^ = hi,i G S and following the same analysis as in the Proof of Theorem 2, we have 

(w* - w*) T ( t/i + ah - rg) > -cr||[w* - w*] 5 ||i + cr|| [w*] 5 <= ||i 

As a result, 

0min(16s) ||___ 11 2 


1 


0 > -||w* - w*||i||q||oo - cr||[w* - w*] 5 ||i + <j||[w*k 


-||w* - W* 11 2 - p(16s)||w* - W *||2 


Then if a > 21|q|| oo, we arrive at the same conclusion with A replaced by ^ m i n (16s) — 2p(16s) 
assuming 0 m j n (16s) > 2p(16s). 


A.3 Proof of Theorem 4 

Let d = w* — w*. First we show that 

Pkl|i<P]s||i 

This is because 

IIw*||i - \\[S\s\\ + P]sji < ||w, + Jllr = ||w*||i < ||w*|| i 

Therefore ||[<5]s c ||i < P]s||i, and we have 

||w*-w*||i _ 2||[w* - w*] 5 ||i 


w * 5c 1 < II l w * - w *k 


and 


< 2y/s 


||W*-W*|| 2 ||W*-W*|| 2 

Similarly, we have the following lemma. 

Lemma 5. If X satisfies the restricted eigen-value condition at sparsity level 4s, then 

<?Wn(4s)||w* - w*|| 2 < -(w* - w*) T A T X(w* - w*) < 4</> max (4s)||w* - w*|| 2 
n 

We continue the proof as follows: 

-\\XS\\l < -\\xs\\t + - d T (X T X - X T X)S 

n n n 
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Since 


-S T X T Xd< ||<5||i- 
n n 

1 


X' X6 


<»- 


X T (lw t - y) - I T (Iw, - y) 
< ||<5||i2(r + a) 


Then we have 


Then we have 


'/'min (4s) 11 w» - w* U 2 < 2 (r + cr)||w* - w*||i + p(4s)||w* - w*|| 2 
< 4 (t + er)||[w* - w*] 5 ||i + p(4s)||w* - w : 

4(t + <j )y/s .. 4(r + a)s 


* II 2 


llw* - w*|| 2 < 


(4s) - p(4s) ’ 


w* - W* 1 < 


</>min(4s) - p(4s) 


(16) 


We then complete the proof of Theorem 4 by noting the upper bound of ||q||oo and by setting a 
according to the Theorem. 


B Proofs of Lemmas 

B.l Proof of Lemma 2 

The proof of Lemma 2 follows that of Theorem 6 in |[25l . For completeness, we present the proof 
here. Since X = (iq,..., x^), 

||q||oo = max -|xj(7 - A T A)e| 

1 <j<d n J 

We first bound for individual j and then apply the union bound. Let x, and e* be normalized version 

of x,; and e, i.e., x.^ = ic.j/1111 2 and e = e/||e|| 2 - Let e == c\J Since A obeys the JL 
lemma, therefore with a probability 1 — 5 we have 

|P x ll2 - IWI2I < e|l x ll2 

Then with a probability 1 — 5, 

r;T,lT^ r : T H +^111 - IIAfo -e)||l ^ T; 


3 ~ 4 

(1 + e) || Xj + e|| 2 + (1 — 


< 


— x, e 


ll X J ® II2 ~T~ 

-A.,- C 


S 2 IIIX 


■3 II2 


I2) < e 


Similarly with a probability 1 — 5, 


||A(X,+S)||!-||A(X,-e) 


x A 1 Ae-x e = 

j j 4 

Therefore with a probability 1 — 25, we have 

-„T 


- xTe > -I- ||e||l) > -e 


1 x]A T Ae-x! e| < Hx^, || 2M e M2 


e|| 2 |x7 A T Ae — x 1 e| < ||xy 


2 e 2 e 


Then applying union bound, we complete the proof. 

B.2 Proof of Lemma 3 


The proof of Lemma 3 follows the analysis in (25|. For completeness, we present the proof here. 
Define Sd, s and ICd,s- 

Sd,s = {u e R d : ||u|| 2 < 1, ||u||o < s}, AC„, a = {u <E R d : ||u|| 2 < 1, ||u||i < ^s} 
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Due to conv(Sd, a ) Q K-d,s Q 2conv(Sd , a ) ll20l . for any u G KLd, s , we can write it as u = 2 JT A,;Vj 
where v, G JT Ai = 1 and A^ > 0, then we have 


|u T (X T X - X T X)u| = |(Xu) T (J - A t A)(X u 

\ T / 


< 4 


X^A iV J (I-A t A) X^AiVi 


< 4 £ Ai A,-1 (Xvj) T (I - A J A) (Xvj) | 


<4 max |(Xui) T (J — Gl T Gl)(Xu 2 )| AiAj = 4 max |(Xui) T (J — Al T Al)(Xu 2 ) 

Hi ^ Ui,U2E«Sd r 


ui,u 2 E5d. 

Therefore 


max |(Xu) T (/— Al T Al)(Xu)| < 4 max |(Xui) T (I — A T Al)(Xu 2 ) (17) 
uG/C<j, a ui,u 2 GS<i,s 

Following the Proof of Lemma 2, for any fixed ui, U 2 G <S<j )S , with a probability 1 — 26 we have 
1, 


log(l/<5) 


■|(X u i) T (/-^ T A)(X u 2 )| < — ||XU! || 2 ||XU2||2e < ^max(s)c 
n n 

where we use the restricted eigen-value condition 

ll*u || 2 _ /T-TT 

/— — V V^rnax (s) 

ues d , s \/n 

To prove the bound for all ui, U 2 G Sd, s , we consider the e proper-net of Sd, s lf2Qj denoted by 
Sd, s (e)- Lemma 3.3 in j20l shows that the entropy of 6)/ s , i.e., the cardinality of S,i s (e) denoted 
N(Sd,si e) is bounded by 

logX(5 diS ,e) < slog (— 


Then by using the union bound, we have with a probability 1 — 2<5, we have 


max — |(Xu 1 ) T (/ - A t A)( Xu 2 )| < f max {s)c 
n 

u 2 ,(e) 


log (N 2 (S d ,s,e)/5) 


if 0max( , s)c 


log(l/<S) + 2slog(9d/es) 


(18) 


To proceed the proof, we need the following lemma. 

Lemma 6. Let 


For e G (0, l/x/2), we have 


£ s ( u 2 ) < 


= max 

u 3 r (7u 2 | 

uiGSd,„ 


max 

\uJUu 2 \ 

ui£Sd, s (e) 


l-v^e) 

£ s (u 2 ,e) 


Proof. Let U = iX T (/ — A T A)X. Following Lemma 9.2 of lfl5l . for any u,u' G we can 
always find two vectors v, v' such that 

U - ll' = V - v', ||v||o < S, ||v'||o < s, v T v' = 0. 


Thus 


|(u- u',f/u 2 )| < |(v, Uu 2 )\ + |(-v',t/u 2 )| 


= ll v l|2 


ll V 2 


-;UU, 


Mb 


(irl c,Us ) 


— (IKII 2 + ||v'|| 2 )£ s (u 2 ) < ^ s (u 2 )v / 2^/||v||§ -h ||v'||| 

=£ s (u 2 )\/2||v - v'|| 2 = f s (u 2 )v / 2||v - v'|| 2 = £ s (u 2 )\/2||u - u'|| 2 . 
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Then, we have 


£ 3 ( 112 ) = max |u T C/u 2 | < max |u T ?7u 2 | + sup (u — u,[/u 2 ) 

u S5 d , s U £S d:S (e) ue5 dil 

u 'e S d,„(<0,l|u-u'||2<e 

<£ s (u 2 ,e) + v^e£ s (u 2 ) 


which implies 


£ s (u 2 ) < 


£ a (u 2 ,e) 
1 — \/2e 


□ 


Lemma 7. Let 


£ s (e) = max £ s (u 2 ,e)= max |u7l7u 2 | 

U2 €Sd,s U 1 € S d,s 

u 2 e*S ds (e) 

£ s (e, e)= max £ s (u 2 ,e)= max |u 7 tfu 2 


For e £ (0, l/y/2), we have 


u 2 eS<i,s(e) 


£ s (e) < 


1 - V2e 


ui,u 2 eSj,,(e) 


£s(e,e) 


The proof the above lemma follows the same analysis as that of Lemma 6 . By combining Lemma 6 
and Lemma 7, we have 


max U2gl s d £ s (u 2 ,e) 1 

max £ s (u 2 ) < --- -= - = -- =-S s (e) < 

u 2 eS d ,s 1 — v 2e 1 — v 2e 


1 


1 - V2t 


£s(e, e) 


1 


max |u 7 tfu 2 | 


.I — y/2e J ui,u 2 G5d, a (e) 

By combing the above inequality with inequality Q/7] and (fl8l >. we have 

Ps < 4 max £ s (u 2 ) < 4 ( - </> max (s)c 

u 26 S d ,s \1 — v2e/ 

If we set e = 1 /(2 v^2), we can complete the proof. 


log(l/<5) + 2slog(9d/es) 


B.3 Proof of Lemma 4 

Since 


|w* - w«||i 

|w* - W*|| 2 


< 4\/s = VT 6 s, 


Therefore 


II w,— w. 111 

II w»—W» 2 


£ ICd,ies- The left inequality follows the restricted eigen-value condition and 

conv(Sd , s ) C KLd,s■ For the right inequality, we note that KLd, s Q 2conv(Sd, s ), hence for any 
u £ Kd,s, we can write u = 2 JT with A» = 1 , Aj > 0 , and v, ; £ Sd, s - 

^u T I T Iu = /(u) = /(2 ^2 AiVj) < ^2 Ai/(2vj) < i ^ A 2 4 v7X t Xv, ; < 4^ max (s) 

i i i 

Therefore 


-(w* - w*) X X(w* - w*) < 4</> max (16s)||w* - 


n 
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